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In this paper we examine inefficiencies and information disparity in the Japanese stock market. 
By carefully analysing information publicly available on the internet, an 'outsider' to conventional 
statistical arbitrage strategies — which are based on market microstructure, company releases, or 
analyst reports — can nevertheless pursue a profitable trading strategy. A large volume of blog data 
is used to demonstrate the existence of an inefficiency in the market. An information-based model 
that replicates the trading strategy is developed to estimate the degree of information disparity. 



1. Introduction. Since the dawn of history, informa- 
tion has always been generated locally; it then spreads 
globally by various means, often being lost and some- 
times being rediscovered. Nothing has fundamentally 
changed with the advent of the internet. Here again, in- 
formation is generated locally on individual web sites and 
then, due to the potency of content and presentation, as 
well as the vagaries of place and timing, disappears into 
some data repository, or is picked up and amplified, cre- 
ating avalanche effects. Nowadays information is often 
posted initially on blogs and twitter accounts, or dis- 
cussed on bulletin boards, and only subsequently, with 
some delay, reaches the traditional media as represented 
by newspapers and television. This dissemination from a 
small to a wider circle of viewers is also of interest in the 
financial market context, because as knowledge spreads, 
it starts influencing investment decisions. We demon- 
strate in this paper that by capturing these trends at an 
early stage of information diffusion in a systematic and 
quantitative manner it is possible to construct a supe- 
rior trading strategy, thus establishing the existence of 
market inefficiencies. 

A closely related issue to information extraction in fi- 
nancial markets is the valuation of information. Suppose 
one is in possession of a piece of information, deemed 
valuable, that one wishes to monetise. How docs one 
price information, when information is viewed as a trad- 
able asset? For instance, consider the information that 
the price of a given stock will move up the following day 
with 75% likelihood. Leaving aside issues to do with 
insider trading for the moment, if one was to 'sell' this 
piece of information, how should one set a fair price? Ev- 
idently, in this example the price depends on a number of 
market factors, such as market impact. It also depends 
crucially on whether this information provision is a one- 
off event or whether such information will be supplied on 
a regular basis. All these issues make it virtually impossi- 
ble to arrive at the notion of a 'fair price' of information. 
It is nevertheless possible to associate a rate of return 
with the use of information, as we shall show here. 

In the efficient market theory 'all' the publicly available 



information is incorporated in the price by the marginal 
investor. This bold statement, which comes in various 
forms, has often been criticised in the literature (e.g., 
Grossman and Stiglitz 1980). Often there is an abun- 
dance of valuable information that is widely accessible to 
the whole market but from which not everyone has the 
resources or analytic capability to extract useful signals. 
Indeed, not even the so-called 'marginal' investor appears 
to exploit this additional data. The important point is 
that the distribution of information is never homogeneous 
because the ability to extract something useful is inho- 
mogeneous across different market agents. 

To establish a relationship between information and 
investment return, we must first identify what is meant 
by information. In financial markets information con- 
sists of two parts: signal and noise. By 'signal' we mean 
components of information that are dependent on the ac- 
tual return of, say, an investment; whereas by 'noise' we 
mean components of information that are statistically in- 
dependent of the actual return of that investment. Both 
components have direct impact on price dynamics, but it 
is ultimately the signal component that determines the 
realised value of the return. Thus, given this noisy in- 
formation, market participants try their best to estimate 
the signal; this estimate (in a suitably defined sense dis- 
cussed below) in turn determines the random dynamics 
of the associated price process. 

In many cases signal and noise are superimposed in 
an additive fashion. In other words, there are essentially 
two unknowns, 'signal' and 'noise', and one known, 'sig- 
nal plus noise'. The rate at which the signal is revealed to 
the market then determines the signal-to-noise ratio. The 
kind of information inhomogeneity discussed above there- 
fore arises primarily from the fact that different agents 
have different signal-to-noise ratios. With further refine- 
ments, however, one finds that signal-to-noise ratio is it- 
self rarely known in financial markets, i.e. it is what 
one might call a known unknown. Yet, it is the signal- 
to-noise ratio that directly affects the performance of an 
investment. Hence we can determine the relative ratios of 
signal-to-noise ratios of different agents from their per- 
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formances. This is one objective of the present paper. 
We examine the ratio of two signal-to-noise ratios; one 
for the market as a whole, and one for an internet-search 
based strategy. 

Our choice for using an internet-search based strategy, 
as a comparison against the market, should be evident: 
most information circulates via the internet. Unlike tra- 
ditional investment firms, large internet search engines, 
by their very nature and in spite of being 'outsiders' to 
financial markets, are well positioned to extract signals 
from large data sets. From the viewpoint of internet 
search engines, the kind of analysis discussed here also 
has a profound implication. One of the key difficulties in 
the business of information provision is in the quantita- 
tive assessment of the validity and quality of the search 
engines or other recommendation tools. However, we now 
recognise that financial market dynamics provide a suit- 
able testing ground, and one with rapid feedback. For 
example, a "celebrity popularity engine" offered by in- 
ternet companies, useful to advertisers, can be applied 
to individual companies; the quality of the engine, which 
otherwise would have been difficult to assess, can now 
be tested instantly against the future movements of the 
corresponding stock prices. 

We have therefore taken a large number of blog articles 
from the internet, applied natural language processing 
(NLP) to convert numerous texts into numerical senti- 
ment indices for individual listed companies, and then 
developed a trading strategy that converts the sentiment 
indices into portfolio positions. The results show the ex- 
istence of an astonishing inefficiency in a highly liquid 
equity market. We also construct a theoretical model, 
within the information-based asset pricing framework of 
Brody-Hughston-Macrina (BHM), for the characterisa- 
tion of the strategy. The model has the advantage that 
the ratio of the signal-to-noise ratios between the in- 
formed outsider and the general market can be estimated 
from the investment performance. 

2. Information and asset price. To understand the 
interplay between information and asset price, we must 
first step back from the conventional approach in quanti- 
tative finance, and begin by identifying the main sources 
for price movements at a phenomenological level. After 
a little reflection it should not be difficult to identify two 
important factors, namely, risk preference and available 
information. To understand these two factors we list two 
different scenarios: (i) I would have bought the new Toy- 
ota car, had I not lost my job; (ii) I would have bought 
the new Toyota car, had 1 not read the news of the re- 
call. In case (i) the assessment of the worthiness of the 
product has not changed, but the purchase decision has 
nevertheless been affected by the changes in one's ap- 
petite toward risk; whereas in case (ii) the assessment 
of the worthiness of the product has changed due to the 
arrival of new information. 



It is often argued that the price dynamics is generated 
by supply and demand; this is indeed so, but it has to be 
noted that a large part of supply and demand in financial 
markets is induced by the arrival of information (for ex- 
ample, an announcement of a substantial profit leading 
to high demand for company shares). We thus take the 
view that the traditional 'supply and demand' argument 
is in fact mostly the symptom and not the cause, at least 
in the case of highly liquid financial instruments. 

As regards changes in risk preference, at the individual 
level this can be relatively volatile, but averaged over the 
market the volatility will be reduced. On the other hand, 
the flow of information is significantly more dynamic and 
volatile. It is common for a dynamical system to depend 
on fast moving and slowly moving variables; in the case 
of a financial market, information is the fast moving and 
risk preference is the slowly moving variable. For our 
strategy, the changes in overall risk preference have little 
impact, because we only test market neutral strategics 
that have no exposure to the overall risk preference of 
the market. Therefore, our first simplifying assumption 
is to regard market risk preference as fixed, and focus 
attention on the structure of information. Phrased in 
more technical terms, we will assume that the pricing 
measure is given once and for all, and we shall construct 
the market filtration from the outset, which will be used 
to derive the price process. This is in line with the BHM 
approach introduced in Brody et al. (2007, 2008), which 
will now be reviewed briefly. 

Consider an elementary asset that pays a single div- 
idend X at time T (e.g., a credit- risky discount bond). 
We assume that there is an established pricing measure 
Q, under which the random cash flow X has the a priori 
density p(x). In this case, market participants are con- 
cerned about the realised value of X. In particular, the 
risk-adjusted view of the market today about the cash 
flow is represented by the a priori density p{x). By to- 
morrow, however, the market will obtain additional noisy 
information, based on which the market will update its 
view, represented in the form of an a posteriori density 
for X. This information consists of two components; sig- 
nal and noise. Although the signal-to-noise ratio is gen- 
erally unknown, and furthermore it will change in time, 
let us assume for simplicity that it is known to the mar- 
ket, and that it is given by a constant a. We also assume 
for the moment that the market is efficient in the sense 
that all available information is used in the determina- 
tion of the price today. Hence there is no residual noise 
today. Likewise, the noise will vanish at time T when the 
value of X is revealed for sure. To keep the matter sim- 
ple, we model the noise term by the simplest Gaussian 
process that vanishes at time and time T — the Brow- 
nian bridge process {Ptr} over the time interval [0, T]. 
Therefore, our choice for the information is 

Zt = aXt + fcr. (1) 
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The market filtration {Ft] is thus generated by the 
knowledge {£ s }o<s<t of the information process. 

If we write PtT for the discount function, and assume 
that it is deterministic, then the price at time t of the 
asset is determined by St = PtT^[X\J 7 t\. A short calcu- 
lation then shows that the price process is given by 

J p(x)e T ~ t[ - 4t 2 >dx 

We see therefore that in the BHM framework it is possible 
to derive the price process in a manner that replicates 
how price processes are generated in the first place via 
flow of information. In spite of the various simplifying 
assumptions, the resulting price process ^ is very rich 
and possesses many desirable features. Perhaps the most 
notable from a practical point of view is the fact that the 
pricing and the hedging of elementary contingent claims 
are made easy. 

3. Modelling the informed outsider. Within the 
BHM framework it is straightforward to model the infor- 
mation disparity seen in the market. Indeed, it has been 
shown in Brody et al. (2009) that if there is an informed 
trader in the market who has access not only to the mar- 
ket information ([I]) but also to an additional information 
source = a' Xt + j3' tT , then the informed trader can 
exploit the information to generate statistical arbitrage. 
Here we shall modify the setup considered therein so as to 
replicate the trading strategy that we have developed by 
use of data taken from the internet, and calibrate some 
of the model parameters. In this manner we are able to 
test the performance of internet-based recommendation 
or rating engines from investment performances. 

Our modelling setup can be summarised as follows. 
We let X be a binary random variable taking the values 
{0, 1}, where 1 represents price moving up by a unit over 
the period [0, T] and represents price moving down by 
a unit over the same period. At time both the mar- 
ket and the informed trader share the same information 
about the value of X, represented by the a priori proba- 
bilities (p, 1—p). The informed trader, however, begins to 
gather information from the internet, using text and data 
mining; whereas the general market gathers information 
through more widely accessible sources such as newspa- 
per articles and financial reports. We let £t of repre- 
sent the market information process, and ^ = a'Xt + /3' tT 
represent the extra information gathered from the inter- 
net, where the two noises {/?tr} and {/3' tT } may be de- 
pendent, with correlation p. It is shown in Brody et al. 
(2009) that in the case of multiple information sources 
the knowledge of the informed trader can be represented 
in the form of a single effective information process 

£ t = aXt + P tT , (3) 



where a 2 = (a 2 - 2paa' + <r' 2 )/(l - p 2 ), and 
o _ cr-pcr' a 1 - pa 

Therefore, the effective signal-to-noise ratio for the in- 
formed trader is given by a, which can be compared 
against the market signal-to-noise ratio a. 

At time T/2 both the market and the informed trader 
have accumulated noisy information, based on which they 
evaluate the a posteriori probabilities, p rn and pi, respec- 
tively, that X = 1. The trading strategy is as follows. If 
the a posteriori probability is larger than the threshold 
value K + then take a long position by the amount X t ; if 
the a posteriori probability is smaller than the threshold 
value K- then take a short position by the amount X t , 
where X t = E t [X] is the expectation of X using the mar- 
ket filtration. The position is then held till time T, at 
which point the profit or loss is made because the value of 
X is now revealed. Also at time T the next observation 
for the value of the random variable representing whether 
the asset price moves up or down over the interval [T, 2T] 
begins, and the same strategy is repeated over and over. 
Our model thus makes an implicit simplifying assump- 
tion that the magnitude of the stock volatility over the 
range [nT, (n + 1)T] is independent of the value of n. 

Both the market and the informed trader employ the 
same strategy, but the informed trader on average makes 
better estimates for the realised value of X, thus statis- 
tically obtaining a higher rate of return than the market. 
The risk-neutral valuation of the market position can be 
made straightforwardly, because the resulting cash flow 
is given by (X - X t )(t{X t > K+) - l{X t < #_}). By 
a change of measure technique introduced in Brody et 
al. (2007) one can show that the value of the strategy 
is given by a formula analogous to the Black-Scholes op- 
tion pricing formula. The valuation of the position of the 
informed trader is less obvious, although one can show 
that the expected P&L difference is positive, leading to 
a statistical arbitrage opportunity. 

4. Implementation and calibration. We have im- 
plemented the strategy using publicly available informa- 
tion sources. Specifically, we have gathered the totality 
of Japanese blog articles since 2006 and used them as 
our sole information source. In 2009, nearly 20 million 
Japanese blog articles appeared on the internet, making 
a daily average of around 50,000 articles. Each blog arti- 
cle is weighted by its relevance (e.g., page views). Those 
with insufficient weight are regarded as 'pure noise' and 
have been discarded from the analysis. 

Natural language processing (NLP) technology of Ya- 
hoo Japan Corporation and Yahoo Japan Research In- 
stitute has been applied to analyse company specific 
comments of the listed companies. The NLP classifies 
whether the comments are positive, neutral, or negative; 
this classification is then used to establish sentiment in- 
dex for each company. Based on the sentiment index, a 
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trading strategy, analogous to the one described above, 
is developed. The idea can be illustrated as follows. If 
many people write complimentary remarks about a new 
product released by a given company then it is likely that 
sales of the product will go up, leading to an increase in 
its share price. 



P&L curves: market (black) and informed trader (green) 
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FIG. 1: Information-based trading. The blog sentiment data 
is used to create a trading strategy for the relevant stocks. 
The performance (total return) that results from the strategy 
is shown in the solid black line. The dark blue dashed line 
represents the average stock prices; the light blue dashed line 
the Nikkei 225 Index. The 'learning' period for optimisation 
corresponds to the left of the vertical green line; the strategy 
is applied over the seven month period starting in the late 
March 2009. Our information-based strategy yields over 40% 
return for the seven month period. 



The strategy has been optimised using the data from 
2008 to early 2009 (for example, the choice of the thresh- 
old values K±), and applied for the seven month period 
from April 2009. Specifically, for the analysis presented 
here we have considered 10 companies for whom the av- 
erage numbers of blog comments are highest. In order 
to obtain a conservative estimate for the ratio a /a, and 
also to reduce exposure to the market risk preference, 
we have adopted a long-short strategy against the Nikkei 
225 Index. The result of the strategy, as well as the aver- 
age stock prices of the active names and the Nikkei 225 
Index, are shown in figure [T] 

To estimate the ratio a fa we have simulated the strat- 
egy numerically. Because we do not yet have a suitable 
method of estimating the correlation p between the noise 
in the blog sentiments and the noise for market investors, 
we can only give a range for this estimate. Fortunately, 



FIG. 2: Simulation of the strategy. We have run the strategy 
for the informed trader (black solid line) and that for the 
market (blue dashed line), and taken the average over 5,000 
sample paths. Parameter values are set as p = 0.1, a — 0.2, 
a' = 0.48, and hence a = 0.50. 



however, we found that the range is relatively narrow: 



2.4 < - < 2.6. 
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The simulation results associated with the choice p - 
are shown in figure [2] 

5. Discussion. We have successfully extracted a trading 
signal from the abundant data accessible on the internet. 
By applying the results to the stock market, we were able 
to assess the performance of the information extraction 
and provision engine. The results have identified perhaps 
a surprising level of apparent inefficiency even in a highly 
liquid equity market, indicating the degree of information 
inhomogeneity. 

It is of course well documented that asset prices in fi- 
nancial markets respond to the unravelling of information 
(e.g., Engle and Ng 1993; Andersen et al. 2007). Indeed, 
the realisation that information filtering and communi- 
cation is the key for grasping social sciences such as eco- 
nomics has been recognised since Wiener (1954). Our 
analysis differs sharply from previous work carried out in 
this area in that we explicitly identify the existence of in- 
formation disparity and derive an estimate for how much 
more the rate of information extraction could have been 
enhanced had the market been truly efficient. In con- 
trast with Google Finance, for instance, that provides a 
postmortem analysis of the relation between large price 
moves and revelations of news items, our informed trader 
is able to exploit additional information sources to antic- 
ipate price moves. 

The analysis reported in this paper is naturally of in- 
terest to statistical arbitrage funds, because the strategy 
is orthogonal to conventional strategies that rely on, for 
example, microstructure. On the other hand, from the 
viewpoint of an internet search engine, one might envis- 
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age a scenario whereby individual investors purchasing 
'signal' from information providers and making their own 
investments. Such a model, however, is unlikely to be 
sustainable, because if the signal is circulated broadly, it 
ceases to remain useful. As Wiener emphasises, concen- 
tration of useful information is intrinsically unstable due 
to the second law (Wiener 1954). The only way in which 
information can be spontaneously concentrated, at least 
momentarily, is via innovation. It is interesting therefore 
to reflect on the fact that in spite of the enhancement of 
technology in improving the method of information gath- 
ering and provision, whose purpose a priori goes against 
the second law, ultimately such developments can only 
result in enforcing the compliance with the second law. 
As a result, in the long run the second law will enhance 
the 'efficiency' of financial markets, but maybe also, para- 
doxically, the instability of financial markets, because in 
a noise-dominated market, the revelation of the true sig- 
nal has a significant impact. 
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